Authors: "Sydney Holder, Chris Mathew, Nicholas Sager"

In [1]:
knitr::opts_chunk$set(echo = FALSE, warning=FALSE, message=FALSE)
# Required Libraries
library(tidyverse)
library(knitr)
library(kableExtra)
library(ggthemes)
library(caret)
# library(janitor)
# library(doParallel)
#library(e1071)
#library(class)
library(aplore3)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Attaching package: ‘kableExtra’


The following object is masked from ‘package:dplyr’:

    group_rows


Loading required package: lattice


Attaching package: ‘caret’


The following object is masked from ‘package:purrr’:

    lift


Introduction / Objective Summary¶

In this project we will use logistic regression and other models to analyze the glow_bonemed dataset from the aplore3 R package. The objective is to analyze the risk factors of fracture for women with Osteoporosis and to build a model to predict the risk of fracture. For the first objective, we will use a simple logistic regression model to explore how each factor is associated with fracture risk. The focus of this model will be interpretability. For the second objective, we develop several more complex models to predict the risk of fracture and compare their performance.

Data / Processing Summary¶

The data for this project come from a companion package to the textbook "Applied Logistic Regression (3rd Ed.)" by Hosmer, Lemeshow, and Sturdivant. The dataset contains 500 observations of 18 variables from a study on fractures in women with Osteoporosis. The parameters are largely in three categories: administrative data, demographic data, and bone medication status. The response variable in this study, fracture, indicates whether the patient experienced a fracture during the first year of the study. Most variables are described in the exploratory data analysis section. We don't know more about the study design or the data collection process, so we will assume that the study is observational and will use odds ratio interpretations throughout.

There are no missing values or anomalous data present, which is expected for an example from a text book. No data cleaning or processing was required.

In [2]:
#Import Data
bone <- glow_bonemed

head(bone)
str(bone)
names(bone)

# Encode fracture as 0/1
bone$fracture.num <- ifelse(bone$fracture == "Yes", 1, 0)
A data.frame: 6 × 18
sub_idsite_idphy_idpriorfracageweightheightbmipremenomomfracarmassistsmokerateriskfracscorefracturebonemedbonemed_fubonetreat
<int><int><int><fct><int><dbl><int><dbl><fct><fct><fct><fct><fct><int><fct><fct><fct><fct>
111 14No 6270.315828.16055NoNo No No Same 1NoNoNoNo
224284No 6587.116034.02344NoNo No No Same 2NoNoNoNo
336305Yes8850.815720.60936NoYesYesNo Less11NoNoNoNo
446309No 8262.116024.25781NoNo No No Less 5NoNoNoNo
551 37No 6168.015229.43213NoNo No No Same 1NoNoNoNo
665299Yes6768.016126.23356NoNo No YesSame 4NoNoNoNo
'data.frame':	500 obs. of  18 variables:
 $ sub_id    : int  1 2 3 4 5 6 7 8 9 10 ...
 $ site_id   : int  1 4 6 6 1 5 5 1 1 4 ...
 $ phy_id    : int  14 284 305 309 37 299 302 36 8 282 ...
 $ priorfrac : Factor w/ 2 levels "No","Yes": 1 1 2 1 1 2 1 2 2 1 ...
 $ age       : int  62 65 88 82 61 67 84 82 86 58 ...
 $ weight    : num  70.3 87.1 50.8 62.1 68 68 50.8 40.8 62.6 63.5 ...
 $ height    : int  158 160 157 160 152 161 150 153 156 166 ...
 $ bmi       : num  28.2 34 20.6 24.3 29.4 ...
 $ premeno   : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ momfrac   : Factor w/ 2 levels "No","Yes": 1 1 2 1 1 1 1 1 1 1 ...
 $ armassist : Factor w/ 2 levels "No","Yes": 1 1 2 1 1 1 1 1 1 1 ...
 $ smoke     : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 1 1 ...
 $ raterisk  : Factor w/ 3 levels "Less","Same",..: 2 2 1 1 2 2 1 2 2 1 ...
 $ fracscore : int  1 2 11 5 1 4 6 7 7 0 ...
 $ fracture  : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ bonemed   : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 2 1 1 ...
 $ bonemed_fu: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 2 1 1 ...
 $ bonetreat : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 2 1 1 ...
  1. 'sub_id'
  2. 'site_id'
  3. 'phy_id'
  4. 'priorfrac'
  5. 'age'
  6. 'weight'
  7. 'height'
  8. 'bmi'
  9. 'premeno'
  10. 'momfrac'
  11. 'armassist'
  12. 'smoke'
  13. 'raterisk'
  14. 'fracscore'
  15. 'fracture'
  16. 'bonemed'
  17. 'bonemed_fu'
  18. 'bonetreat'

NAs¶

In [3]:
# Check for NA Values
bone %>%
    summarise(across(everything(), ~sum(is.na(.)))) %>% 
    gather(key = "Column", value = "NA_Count", -1) #%>%
    # filter(NA_Count > 0) %>%
    # ggplot(aes(x = reorder(Column, NA_Count), y = NA_Count)) +
    # geom_col() +
    # coord_flip() +
    # theme_gdocs() +
    # labs(title = "Number of NA's by Column", x = "Column", y = "NA Count")
A data.frame: 18 × 3
sub_idColumnNA_Count
<int><chr><int>
0site_id 0
0phy_id 0
0priorfrac 0
0age 0
0weight 0
0height 0
0bmi 0
0premeno 0
0momfrac 0
0armassist 0
0smoke 0
0raterisk 0
0fracscore 0
0fracture 0
0bonemed 0
0bonemed_fu 0
0bonetreat 0
0fracture.num0

There are no NA values in the data set.

Exploratory Data Analysis¶

We will explore each variable compared the the result fracture to see if there are any obvious relationships.

Fracture Prevalence:

In [4]:
# Fracture Prevalence
bone %>%
    group_by(fracture) %>%
    summarise(Count = n()) %>%
    mutate(Percent = round(Count / sum(Count), 3)) %>%
    rename(Fracture = fracture) %>%
    kable("html", caption = "Fracture Prevalence") %>%
    kable_styling("striped", full_width = F)

bone %>%
    ggplot(aes(x = fracture, fill = fracture)) +
    geom_bar() +
    theme_gdocs() +
    labs(title = "Fracture Prevalence", x = "Fracture", y = "Count")
<table class="table table-striped" style="width: auto !important; margin-left: auto; margin-right: auto;">
<caption>Fracture Prevalence</caption>
 <thead>
  <tr>
   <th style="text-align:left;"> Fracture </th>
   <th style="text-align:right;"> Count </th>
   <th style="text-align:right;"> Percent </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> No </td>
   <td style="text-align:right;"> 375 </td>
   <td style="text-align:right;"> 0.75 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Yes </td>
   <td style="text-align:right;"> 125 </td>
   <td style="text-align:right;"> 0.25 </td>
  </tr>
</tbody>
</table>
No description has been provided for this image

The data set is imbalanced. There are 375 patients with no fracture and 125 patients with a fracture.

Sub ID:

In [5]:
summary(bone$sub_id)

# Fracture by sub_id
bone %>%
    ggplot(aes(x = sub_id, fill = fracture)) +
    geom_bar() +
    theme_gdocs() +
    labs(title = "Fracture by sub_id", x = "sub_id", y = "Count", fill = "Fracture")
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    1.0   125.8   250.5   250.5   375.2   500.0 
No description has been provided for this image

The variable sub_id is a unique identifier for each patient. This variable will not be used in the model.

Site ID:

In [6]:
summary(bone$site_id)

# Fracture by site_id
bone %>%
    ggplot(aes(x = site_id, fill = fracture)) +
    geom_bar() +
    theme_gdocs() +
    labs(title = "Fracture by site_id", x = "site_id", y = "Count", fill = "Fracture")

# Fracture by site_id as a proportion
bone %>%
    group_by(site_id) %>%
    summarise(Fracture = sum(fracture.num) / n()) %>%
    ggplot(aes(x = reorder(site_id, Fracture), y = Fracture, fill = factor(site_id))) +
    geom_col() +
    coord_flip() +
    theme_gdocs() +
    labs(
        title = "Fracture by site_id",
        x = "site_id",
        y = "Fracture Rate",
        fill = "Site ID"
    )
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   2.000   3.000   3.436   5.000   6.000 
No description has been provided for this image
No description has been provided for this image

The variable site_id is a unique identifier for each site. Although there are differences, we don't believe that site_id contains useful information. This variable will not be used in the model.

Physician ID:

In [7]:
# Investigate how many patients each physician has
# And whether any physicians have a high fracture rate
bone %>%
    group_by(phy_id) %>%
    summarise(Patients = n(), Fracture = sum(fracture.num) / n()) %>%
    mutate(Fracture = round(Fracture, 3)) %>%
    arrange(desc(Patients)) %>%
    filter(Patients > 5) %>%
    rename(`Physician ID` = phy_id) %>%
    kable("html", caption = "Fracture Rate by Physician") %>%
    kable_styling("striped", full_width = F)

# Fracture by physician_id
bone %>%
    ggplot(aes(x = phy_id, fill = fracture)) +
    geom_bar() +
    theme_gdocs() +
    labs(title = "Fracture by phy_id", x = "physician_id", y = "Count", fill = "Fracture")
<table class="table table-striped" style="width: auto !important; margin-left: auto; margin-right: auto;">
<caption>Fracture Rate by Physician</caption>
 <thead>
  <tr>
   <th style="text-align:right;"> Physician ID </th>
   <th style="text-align:right;"> Patients </th>
   <th style="text-align:right;"> Fracture </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 301 </td>
   <td style="text-align:right;"> 15 </td>
   <td style="text-align:right;"> 0.400 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 287 </td>
   <td style="text-align:right;"> 14 </td>
   <td style="text-align:right;"> 0.429 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 294 </td>
   <td style="text-align:right;"> 14 </td>
   <td style="text-align:right;"> 0.214 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 309 </td>
   <td style="text-align:right;"> 13 </td>
   <td style="text-align:right;"> 0.308 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 37 </td>
   <td style="text-align:right;"> 11 </td>
   <td style="text-align:right;"> 0.273 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 296 </td>
   <td style="text-align:right;"> 11 </td>
   <td style="text-align:right;"> 0.636 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 295 </td>
   <td style="text-align:right;"> 10 </td>
   <td style="text-align:right;"> 0.200 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 313 </td>
   <td style="text-align:right;"> 10 </td>
   <td style="text-align:right;"> 0.200 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 27 </td>
   <td style="text-align:right;"> 9 </td>
   <td style="text-align:right;"> 0.000 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 36 </td>
   <td style="text-align:right;"> 8 </td>
   <td style="text-align:right;"> 0.375 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 64 </td>
   <td style="text-align:right;"> 8 </td>
   <td style="text-align:right;"> 0.750 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 289 </td>
   <td style="text-align:right;"> 8 </td>
   <td style="text-align:right;"> 0.250 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 305 </td>
   <td style="text-align:right;"> 8 </td>
   <td style="text-align:right;"> 0.250 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 50 </td>
   <td style="text-align:right;"> 7 </td>
   <td style="text-align:right;"> 0.429 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 282 </td>
   <td style="text-align:right;"> 7 </td>
   <td style="text-align:right;"> 0.286 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 284 </td>
   <td style="text-align:right;"> 7 </td>
   <td style="text-align:right;"> 0.000 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 299 </td>
   <td style="text-align:right;"> 7 </td>
   <td style="text-align:right;"> 0.000 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 317 </td>
   <td style="text-align:right;"> 7 </td>
   <td style="text-align:right;"> 0.286 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 325 </td>
   <td style="text-align:right;"> 7 </td>
   <td style="text-align:right;"> 0.286 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 39 </td>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:right;"> 0.000 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 63 </td>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:right;"> 0.000 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 79 </td>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:right;"> 0.500 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 194 </td>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:right;"> 0.000 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 281 </td>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:right;"> 0.333 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 288 </td>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:right;"> 0.333 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 300 </td>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:right;"> 0.500 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 304 </td>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:right;"> 0.333 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 315 </td>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:right;"> 0.167 </td>
  </tr>
</tbody>
</table>
No description has been provided for this image

Fracture rates vary across physicians, but each physician has a small number of patients so variance is expected. This variable will not be used in the model.

Prior Fractures:

In [8]:
summary(factor(bone$priorfrac))

# 2x2 Table of prior fracture and fracture
bone %>%
    group_by(priorfrac, fracture) %>%
    summarise(Count = n()) %>%
    pivot_wider(names_from = fracture, values_from = Count) %>%
    mutate(Total = `No` + `Yes`) %>%
    mutate(`No` = round(`No` / Total, 3)) %>%
    mutate(`Yes` = round(`Yes` / Total, 3)) %>%
    select(-Total) %>%
    rename(`Prior Fracture` = priorfrac) %>%
    kable("html", caption = "Fracture by Prior Fracture") %>%
    kable_styling("striped", full_width = F)

# Fracture by prior_fracture
bone %>%
    ggplot(aes(x = priorfrac, fill = fracture)) +
    geom_bar() +
    theme_gdocs() +
    labs(title = "Fracture by Prior Fracture", x = "prior_fracture", y = "Count", fill = "Fracture")

# Fisher's exact test for difference
fisher.test(bone$fracture, bone$priorfrac)
No
374
Yes
126
`summarise()` has grouped output by 'priorfrac'. You can override using the
`.groups` argument.
<table class="table table-striped" style="width: auto !important; margin-left: auto; margin-right: auto;">
<caption>Fracture by Prior Fracture</caption>
 <thead>
  <tr>
   <th style="text-align:left;"> Prior Fracture </th>
   <th style="text-align:right;"> No </th>
   <th style="text-align:right;"> Yes </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> No </td>
   <td style="text-align:right;"> 0.805 </td>
   <td style="text-align:right;"> 0.195 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Yes </td>
   <td style="text-align:right;"> 0.587 </td>
   <td style="text-align:right;"> 0.413 </td>
  </tr>
</tbody>
</table>
	Fisher's Exact Test for Count Data

data:  bone$fracture and bone$priorfrac
p-value = 2.683e-06
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 1.822147 4.583027
sample estimates:
odds ratio 
  2.890213 
No description has been provided for this image

Approximately one quarter of participants have had a prior fracture, which is similar to the post fracture rate. Patients with a prior fracture appear approximately twice as likely to have a fracture during the study.

Age:

In [9]:
summary(bone$age)
bone %>%
    ggplot(aes(x = age, fill = fracture)) +
    # geom_histogram(bins = 20) +
    geom_boxplot() +
    theme_gdocs() +
    labs(title = "Age Distribution", x = "age")

# Fracture by age
bone %>%
    ggplot(aes(x = age, fill = fracture)) +
    geom_bar() +
    theme_gdocs() +
    labs(title = "Fracture by Age", x = "age", y = "Count", fill = "Fracture")

# Loess curve of fracture by age
bone %>%
    ggplot(aes(x = age, y = fracture.num)) +
    geom_point() +
    geom_smooth(method = "loess", span = 1) +
    theme_gdocs() +
    labs(title = "Fracture by Age", x = "age", y = "Fracture")

# T-test for difference in age between fracture and no fracture
t.test(age ~ fracture, data = bone)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  55.00   61.00   67.00   68.56   76.00   90.00 
No description has been provided for this image
`geom_smooth()` using formula = 'y ~ x'
No description has been provided for this image
	Welch Two Sample t-test

data:  age by fracture
t = -4.6184, df = 203.81, p-value = 6.844e-06
alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
95 percent confidence interval:
 -6.145258 -2.468075
sample estimates:
 mean in group No mean in group Yes 
         67.48533          71.79200 
No description has been provided for this image

The age of patients with fractures is significantly higher than without (p < 0.001 from Welch's Two Sample t-test). Fractures appear evenly distributed across ages, although there are fewer patients on the older end of the age range enrolled in the study. We expect that the likelihood of a fracture increases with age and that this predictor will be important in the model.

Weight:

In [10]:
summary(bone$weight)

# Weight distribution by fracture
bone %>%
    ggplot(aes(x = weight, fill = fracture)) +
    geom_boxplot() +
    theme_gdocs() +
    labs(title = "Weight Distribution", x = "weight", fill = "Fracture")

# Fracture by weight
bone %>%
    ggplot(aes(x = weight, fill = fracture)) +
    geom_histogram(bins = 20) +
    theme_gdocs() +
    labs(title = "Fracture by Weight", x = "weight", y = "Count", fill = "Fracture")

# Loess curve of fracture by weight
bone %>%
    ggplot(aes(x = weight, y = fracture.num)) +
    geom_point() +
    geom_smooth(method = "loess", span = 1) +
    theme_gdocs() +
    labs(title = "Fracture by Weight", x = "weight", y = "Fracture")

# T-test for difference in weight between fracture and no fracture
t.test(weight ~ fracture, data = bone)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  39.90   59.90   68.00   71.82   81.30  127.00 
No description has been provided for this image
`geom_smooth()` using formula = 'y ~ x'
No description has been provided for this image
	Welch Two Sample t-test

data:  weight by fracture
t = 0.83719, df = 225.61, p-value = 0.4034
alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
95 percent confidence interval:
 -1.861305  4.611171
sample estimates:
 mean in group No mean in group Yes 
         72.16693          70.79200 
No description has been provided for this image

Weight does not appear to have an effect on fracture rate. The distribution of weight is similar between patients with and without fractures. We expect that this predictor will not be important in the model.

Height:

In [11]:
summary(bone$height)

# Height distribution by fracture
bone %>%
    ggplot(aes(x = height, fill = fracture)) +
    geom_boxplot() +
    theme_gdocs() +
    labs(title = "Height Distribution", x = "height", fill = "Fracture")

# Fracture by height
bone %>%
    ggplot(aes(x = height, fill = fracture)) +
    geom_histogram(bins = 20) +
    theme_gdocs() +
    labs(title = "Fracture by Height", x = "height", y = "Count", fill = "Fracture")

# Loess curve of fracture by height
bone %>%
    ggplot(aes(x = height, y = fracture.num)) +
    geom_point() +
    geom_smooth(method = "loess", span = 1) +
    theme_gdocs() +
    labs(title = "Fracture by Height", x = "height", y = "Fracture")

# T-test for difference in height between fracture and no fracture
t.test(height ~ fracture, data = bone)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  134.0   157.0   161.5   161.4   165.0   199.0 
No description has been provided for this image
`geom_smooth()` using formula = 'y ~ x'
No description has been provided for this image
	Welch Two Sample t-test

data:  height by fracture
t = 2.9031, df = 194.07, p-value = 0.004123
alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
95 percent confidence interval:
 0.6412751 3.3587249
sample estimates:
 mean in group No mean in group Yes 
          161.864           159.864 
No description has been provided for this image

Height does not appear to have an effect on fracture rate. The distribution of height is similar between patients with and without fractures. A two sample t-test suggests the mean weights have a statistically significant difference, but not a practical one. We expect that this predictor will not be important in the model.

BMI:

In [12]:
summary(bone$bmi)

# BMI distribution by fracture
bone %>%
    ggplot(aes(x = bmi, fill = fracture)) +
    geom_boxplot() +
    theme_gdocs() +
    labs(title = "BMI Distribution", x = "bmi", fill = "Fracture")

# Fracture by bmi
bone %>%
    ggplot(aes(x = bmi, fill = fracture)) +
    geom_histogram(bins = 20) +
    theme_gdocs() +
    labs(title = "Fracture by BMI", x = "bmi", y = "Count", fill = "Fracture")

# Loess curve of fracture by bmi
bone %>%
    ggplot(aes(x = bmi, y = fracture.num)) +
    geom_point() +
    geom_smooth(method = "loess", span = 1) +
    theme_gdocs() +
    labs(title = "Fracture by BMI", x = "bmi", y = "Fracture")

# T-test for difference in bmi between fracture and no fracture
t.test(bmi ~ fracture, data = bone)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  14.88   23.27   26.42   27.55   30.79   49.08 
No description has been provided for this image
`geom_smooth()` using formula = 'y ~ x'
No description has been provided for this image
	Welch Two Sample t-test

data:  bmi by fracture
t = -0.33916, df = 217.86, p-value = 0.7348
alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
95 percent confidence interval:
 -1.4066993  0.9936373
sample estimates:
 mean in group No mean in group Yes 
         27.50140          27.70793 
No description has been provided for this image

BMI does not appear to have an effect on fracture rate.

Early Menopause (<45):

In [13]:
summary(factor(bone$premeno))

# 2x2 Table of premenopausal and fracture
bone %>%
    group_by(premeno, fracture) %>%
    summarise(Count = n()) %>%
    pivot_wider(names_from = fracture, values_from = Count) %>%
    mutate(Total = `No` + `Yes`) %>%
    mutate(`No` = round(`No` / Total, 3)) %>%
    mutate(`Yes` = round(`Yes` / Total, 3)) %>%
    select(-Total) %>%
    rename(`Early Menopause` = premeno) %>%
    kable("html", caption = "Fracture by Early Menopause") %>%
    kable_styling("striped", full_width = F)

# Fracture by premeno
bone %>%
    ggplot(aes(x = premeno, fill = fracture)) +
    geom_bar() +
    theme_gdocs() +
    labs(title = "Fracture by Early Menopause", x = "Early Menopause", y = "Count", fill = "Fracture")

# Fisher's exact test for difference
fisher.test(bone$fracture, bone$premeno)
No
403
Yes
97
`summarise()` has grouped output by 'premeno'. You can override using the
`.groups` argument.
<table class="table table-striped" style="width: auto !important; margin-left: auto; margin-right: auto;">
<caption>Fracture by Early Menopause</caption>
 <thead>
  <tr>
   <th style="text-align:left;"> Early Menopause </th>
   <th style="text-align:right;"> No </th>
   <th style="text-align:right;"> Yes </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> No </td>
   <td style="text-align:right;"> 0.752 </td>
   <td style="text-align:right;"> 0.248 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Yes </td>
   <td style="text-align:right;"> 0.742 </td>
   <td style="text-align:right;"> 0.258 </td>
  </tr>
</tbody>
</table>
	Fisher's Exact Test for Count Data

data:  bone$fracture and bone$premeno
p-value = 0.8962
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 0.6050303 1.7861989
sample estimates:
odds ratio 
  1.051975 
No description has been provided for this image

Menopause does not appear to affect the observed fracture rate. Both early and typical menopausal onset groups have a fracture rate of approximately 25%. We expect that this predictor will not be important in the model.

Maternal Hip Fracture:

In [14]:
summary(factor(bone$momfrac))

# 2x2 Table of maternal hip fracture and fracture
bone %>%
    group_by(momfrac, fracture) %>%
    summarise(Count = n()) %>%
    pivot_wider(names_from = fracture, values_from = Count) %>%
    mutate(Total = `No` + `Yes`) %>%
    mutate(`No` = round(`No` / Total, 3)) %>%
    mutate(`Yes` = round(`Yes` / Total, 3)) %>%
    select(-Total) %>%
    rename(`Maternal Hip Fracture` = momfrac) %>%
    kable("html", caption = "Fracture by Maternal Hip Fracture") %>%
    kable_styling("striped", full_width = F)

# Fracture by momfrac
bone %>%
    ggplot(aes(x = momfrac, fill = fracture)) +
    geom_bar() +
    theme_gdocs() +
    labs(title = "Fracture by Maternal Hip Fracture", x = "Maternal Hip Fracture", y = "Count", fill = "Fracture")

# Fisher's exact test for difference
fisher.test(bone$fracture, bone$momfrac)
No
435
Yes
65
`summarise()` has grouped output by 'momfrac'. You can override using the
`.groups` argument.
<table class="table table-striped" style="width: auto !important; margin-left: auto; margin-right: auto;">
<caption>Fracture by Maternal Hip Fracture</caption>
 <thead>
  <tr>
   <th style="text-align:left;"> Maternal Hip Fracture </th>
   <th style="text-align:right;"> No </th>
   <th style="text-align:right;"> Yes </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> No </td>
   <td style="text-align:right;"> 0.768 </td>
   <td style="text-align:right;"> 0.232 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Yes </td>
   <td style="text-align:right;"> 0.631 </td>
   <td style="text-align:right;"> 0.369 </td>
  </tr>
</tbody>
</table>
	Fisher's Exact Test for Count Data

data:  bone$fracture and bone$momfrac
p-value = 0.02127
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 1.062961 3.457099
sample estimates:
odds ratio 
  1.932961 
No description has been provided for this image

History of hip fracture in the subject's mother appears to be a strong predictor of fracture. It is associated with a 1.9x increase in the odds of observing a fracture (p = 0.02 from Fisher's Exact test) compared to women without maternal hip fractures. We expect that this predictor will be important in the model.

Need Arms to Stand Up:

In [15]:
summary(factor(bone$armassist))

# 2x2 Table of arm assist and fracture
bone %>%
    group_by(armassist, fracture) %>%
    summarise(Count = n()) %>%
    pivot_wider(names_from = fracture, values_from = Count) %>%
    mutate(Total = `No` + `Yes`) %>%
    mutate(`No` = round(`No` / Total, 3)) %>%
    mutate(`Yes` = round(`Yes` / Total, 3)) %>%
    select(-Total) %>%
    rename(`Need Arms to Stand Up` = armassist) %>%
    kable("html", caption = "Fracture by Need Arms to Stand Up") %>%
    kable_styling("striped", full_width = F)

# Fracture by armassist
bone %>%
    ggplot(aes(x = armassist, fill = fracture)) +
    geom_bar() +
    theme_gdocs() +
    labs(title = "Fracture by Need Arms to Stand Up", x = "Need Arms to Stand Up", y = "Count", fill = "Fracture")

# Fisher's exact test for difference
fisher.test(bone$fracture, bone$armassist)
No
312
Yes
188
`summarise()` has grouped output by 'armassist'. You can override using the
`.groups` argument.
<table class="table table-striped" style="width: auto !important; margin-left: auto; margin-right: auto;">
<caption>Fracture by Need Arms to Stand Up</caption>
 <thead>
  <tr>
   <th style="text-align:left;"> Need Arms to Stand Up </th>
   <th style="text-align:right;"> No </th>
   <th style="text-align:right;"> Yes </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> No </td>
   <td style="text-align:right;"> 0.801 </td>
   <td style="text-align:right;"> 0.199 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Yes </td>
   <td style="text-align:right;"> 0.665 </td>
   <td style="text-align:right;"> 0.335 </td>
  </tr>
</tbody>
</table>
	Fisher's Exact Test for Count Data

data:  bone$fracture and bone$armassist
p-value = 0.0009071
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 1.317783 3.129094
sample estimates:
odds ratio 
  2.029256 
No description has been provided for this image

Requiring the use of arms to stand from a chair appears to be associated with an increased risk of fracture. Positive answers to this question are associated with a 2x odds of fracture (p < 0.0001 from Fisher's Exact test). We expect that this predictor will be important in the model.

Smoking:

In [16]:
summary(factor(bone$smoke))

# 2x2 Table of smoking and fracture
bone %>%
    group_by(smoke, fracture) %>%
    summarise(Count = n()) %>%
    pivot_wider(names_from = fracture, values_from = Count) %>%
    mutate(Total = `No` + `Yes`) %>%
    mutate(`No` = round(`No` / Total, 3)) %>%
    mutate(`Yes` = round(`Yes` / Total, 3)) %>%
    select(-Total) %>%
    rename(`Smoking` = smoke) %>%
    kable("html", caption = "Fracture by Smoking") %>%
    kable_styling("striped", full_width = F)

# Fracture by smoke
bone %>%
    ggplot(aes(x = smoke, fill = fracture)) +
    geom_bar() +
    theme_gdocs() +
    labs(title = "Fracture by Smoking", x = "Smoking", y = "Count", fill = "Fracture")

# Fisher's exact test for difference
fisher.test(bone$fracture, bone$smoke)
No
465
Yes
35
`summarise()` has grouped output by 'smoke'. You can override using the
`.groups` argument.
<table class="table table-striped" style="width: auto !important; margin-left: auto; margin-right: auto;">
<caption>Fracture by Smoking</caption>
 <thead>
  <tr>
   <th style="text-align:left;"> Smoking </th>
   <th style="text-align:right;"> No </th>
   <th style="text-align:right;"> Yes </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> No </td>
   <td style="text-align:right;"> 0.746 </td>
   <td style="text-align:right;"> 0.254 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Yes </td>
   <td style="text-align:right;"> 0.800 </td>
   <td style="text-align:right;"> 0.200 </td>
  </tr>
</tbody>
</table>
	Fisher's Exact Test for Count Data

data:  bone$fracture and bone$smoke
p-value = 0.5496
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 0.264191 1.782237
sample estimates:
odds ratio 
 0.7356144 
No description has been provided for this image

Smoking does not appear to be associate with a change in fracture rate.

Self-reported Risk:

In [17]:
summary(factor(bone$raterisk))

# Fracture by raterisk
bone %>%
    ggplot(aes(x = raterisk, fill = fracture)) +
    geom_bar() +
    theme_gdocs() +
    labs(title = "Fracture by Self-reported Risk", x = "Self-reported Risk", y = "Count", fill = "Fracture")

# Table of fracture prevalence by raterisk
bone %>%
    group_by(raterisk, fracture) %>%
    summarise(Count = n()) %>%
    pivot_wider(names_from = fracture, values_from = Count) %>%
    mutate(Total = `No` + `Yes`) %>%
    mutate(`No` = round(`No` / Total, 3)) %>%
    mutate(`Yes` = round(`Yes` / Total, 3)) %>%
    select(-Total) %>%
    rename(`Self-reported Risk` = raterisk) %>%
    kable("html", caption = "Fracture by Self-reported Risk") %>%
    kable_styling("striped", full_width = F)

# Fisher's exact test for difference
fisher.test(bone$fracture, bone$raterisk)

# Logistic regression for difference
glm(fracture ~ raterisk, data = bone, family = "binomial") %>%
    summary()
Less
167
Same
186
Greater
147
`summarise()` has grouped output by 'raterisk'. You can override using the
`.groups` argument.
<table class="table table-striped" style="width: auto !important; margin-left: auto; margin-right: auto;">
<caption>Fracture by Self-reported Risk</caption>
 <thead>
  <tr>
   <th style="text-align:left;"> Self-reported Risk </th>
   <th style="text-align:right;"> No </th>
   <th style="text-align:right;"> Yes </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Less </td>
   <td style="text-align:right;"> 0.832 </td>
   <td style="text-align:right;"> 0.168 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Same </td>
   <td style="text-align:right;"> 0.742 </td>
   <td style="text-align:right;"> 0.258 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Greater </td>
   <td style="text-align:right;"> 0.667 </td>
   <td style="text-align:right;"> 0.333 </td>
  </tr>
</tbody>
</table>
	Fisher's Exact Test for Count Data

data:  bone$fracture and bone$raterisk
p-value = 0.002859
alternative hypothesis: two.sided
Call:
glm(formula = fracture ~ raterisk, family = "binomial", data = bone)

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)      -1.6023     0.2071  -7.735 1.03e-14 ***
rateriskSame      0.5462     0.2664   2.050   0.0404 *  
rateriskGreater   0.9091     0.2711   3.353   0.0008 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 562.34  on 499  degrees of freedom
Residual deviance: 550.58  on 497  degrees of freedom
AIC: 556.58

Number of Fisher Scoring iterations: 4
No description has been provided for this image

Self-reported risk of fracture appears to be a strongly associated with fractures. Fracture rates are greater as self-reported risk increases, and each level is statistically significantly different from the previous (p < 0.05 from Logistic Regression). The risk scores are approximately evenly distributed between levels. We expect that this predictor will be important in the model.

Additionally, self-reported risk does not appear to be related to height or weight, but is influenced by the presence of a prior fracture:

In [18]:
ggplot(glow_bonemed, aes(y = weight, x = height, color = raterisk)) +
    geom_point() +
    theme_gdocs() +
    labs(title = "Self-reported Risk by Height and Weight", x = "Height", y = "Weight", color = "Self-reported Risk")

ggplot(glow_bonemed, aes(x = age, y = weight, color = raterisk)) +
    geom_point() +
    theme_gdocs() +
    labs(title = "Self-reported Risk by Age and Weight", x = "Age", y = "Weight", color = "Self-reported Risk")

ggplot(glow_bonemed, aes(priorfrac, fill = raterisk)) +
    geom_bar(position = "fill") +
    theme_gdocs() +
    labs(title = "Self-reported Risk by Prior Fracture", x = "Prior Fracture", y = "Count", fill = "Self-reported Risk")
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Fracture Risk Score (Composite Risk Score):

In [19]:
summary((bone$fracscore))

# Fracture by fracscore
bone %>%
    ggplot(aes(x = fracscore, fill = fracture)) +
    geom_bar() +
    theme_gdocs() +
    labs(title = "Fracture by Fracture Risk Score", x = "Fracture Risk Score", y = "Count", fill = "Fracture")

# Scatterplot using fracture.num and loess curve
bone %>%
    ggplot(aes(x = fracscore, y = fracture.num)) +
    geom_point(position = position_jitter(width = 0.3, height = 0.05)) +
    geom_smooth(method = "loess") +
    theme_gdocs() +
    labs(title = "Fracture Risk Score vs. Fracture", x = "Fracture Risk Score", y = "Fracture")

# Table of fracture prevalence by fracscore
bone %>%
    group_by(fracscore, fracture) %>%
    summarise(Count = n()) %>%
    pivot_wider(names_from = fracture, values_from = Count) %>%
    mutate(Total = `No` + `Yes`) %>%
    mutate(`No` = round(`No` / Total, 3)) %>%
    mutate(`Yes` = round(`Yes` / Total, 3)) %>%
    select(-Total) %>%
    rename(`Fracture Risk Score` = fracscore) %>%
    kable("html", caption = "Fracture by Fracture Risk Score") %>%
    kable_styling("striped", full_width = F)

# View data with fracscore of 10 and 11
bone %>%
    filter(fracscore == 10 | fracscore == 11) %>%
    kable("html", caption = "Data with Fracture Risk Score of 10 or 11") %>%
    kable_styling("striped", full_width = F)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   2.000   3.000   3.698   5.000  11.000 
`geom_smooth()` using formula = 'y ~ x'
No description has been provided for this image
`summarise()` has grouped output by 'fracscore'. You can override using the
`.groups` argument.
<table class="table table-striped" style="width: auto !important; margin-left: auto; margin-right: auto;">
<caption>Fracture by Fracture Risk Score</caption>
 <thead>
  <tr>
   <th style="text-align:right;"> Fracture Risk Score </th>
   <th style="text-align:right;"> No </th>
   <th style="text-align:right;"> Yes </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0.875 </td>
   <td style="text-align:right;"> 0.125 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0.897 </td>
   <td style="text-align:right;"> 0.103 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 0.843 </td>
   <td style="text-align:right;"> 0.157 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 0.797 </td>
   <td style="text-align:right;"> 0.203 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 0.734 </td>
   <td style="text-align:right;"> 0.266 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:right;"> 0.691 </td>
   <td style="text-align:right;"> 0.309 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:right;"> 0.606 </td>
   <td style="text-align:right;"> 0.394 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 7 </td>
   <td style="text-align:right;"> 0.667 </td>
   <td style="text-align:right;"> 0.333 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 8 </td>
   <td style="text-align:right;"> 0.458 </td>
   <td style="text-align:right;"> 0.542 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 9 </td>
   <td style="text-align:right;"> 0.100 </td>
   <td style="text-align:right;"> 0.900 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 10 </td>
   <td style="text-align:right;"> NA </td>
   <td style="text-align:right;"> NA </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 11 </td>
   <td style="text-align:right;"> NA </td>
   <td style="text-align:right;"> NA </td>
  </tr>
</tbody>
</table>
<table class="table table-striped" style="width: auto !important; margin-left: auto; margin-right: auto;">
<caption>Data with Fracture Risk Score of 10 or 11</caption>
 <thead>
  <tr>
   <th style="text-align:right;"> sub_id </th>
   <th style="text-align:right;"> site_id </th>
   <th style="text-align:right;"> phy_id </th>
   <th style="text-align:left;"> priorfrac </th>
   <th style="text-align:right;"> age </th>
   <th style="text-align:right;"> weight </th>
   <th style="text-align:right;"> height </th>
   <th style="text-align:right;"> bmi </th>
   <th style="text-align:left;"> premeno </th>
   <th style="text-align:left;"> momfrac </th>
   <th style="text-align:left;"> armassist </th>
   <th style="text-align:left;"> smoke </th>
   <th style="text-align:left;"> raterisk </th>
   <th style="text-align:right;"> fracscore </th>
   <th style="text-align:left;"> fracture </th>
   <th style="text-align:left;"> bonemed </th>
   <th style="text-align:left;"> bonemed_fu </th>
   <th style="text-align:left;"> bonetreat </th>
   <th style="text-align:right;"> fracture.num </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:right;"> 305 </td>
   <td style="text-align:left;"> Yes </td>
   <td style="text-align:right;"> 88 </td>
   <td style="text-align:right;"> 50.8 </td>
   <td style="text-align:right;"> 157 </td>
   <td style="text-align:right;"> 20.60936 </td>
   <td style="text-align:left;"> No </td>
   <td style="text-align:left;"> Yes </td>
   <td style="text-align:left;"> Yes </td>
   <td style="text-align:left;"> No </td>
   <td style="text-align:left;"> Less </td>
   <td style="text-align:right;"> 11 </td>
   <td style="text-align:left;"> No </td>
   <td style="text-align:left;"> No </td>
   <td style="text-align:left;"> No </td>
   <td style="text-align:left;"> No </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 91 </td>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 141 </td>
   <td style="text-align:left;"> Yes </td>
   <td style="text-align:right;"> 87 </td>
   <td style="text-align:right;"> 61.2 </td>
   <td style="text-align:right;"> 160 </td>
   <td style="text-align:right;"> 23.90625 </td>
   <td style="text-align:left;"> No </td>
   <td style="text-align:left;"> Yes </td>
   <td style="text-align:left;"> Yes </td>
   <td style="text-align:left;"> No </td>
   <td style="text-align:left;"> Greater </td>
   <td style="text-align:right;"> 10 </td>
   <td style="text-align:left;"> No </td>
   <td style="text-align:left;"> Yes </td>
   <td style="text-align:left;"> Yes </td>
   <td style="text-align:left;"> Yes </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 136 </td>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:right;"> 298 </td>
   <td style="text-align:left;"> Yes </td>
   <td style="text-align:right;"> 86 </td>
   <td style="text-align:right;"> 45.4 </td>
   <td style="text-align:right;"> 163 </td>
   <td style="text-align:right;"> 17.08758 </td>
   <td style="text-align:left;"> No </td>
   <td style="text-align:left;"> No </td>
   <td style="text-align:left;"> Yes </td>
   <td style="text-align:left;"> No </td>
   <td style="text-align:left;"> Greater </td>
   <td style="text-align:right;"> 10 </td>
   <td style="text-align:left;"> No </td>
   <td style="text-align:left;"> Yes </td>
   <td style="text-align:left;"> Yes </td>
   <td style="text-align:left;"> Yes </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 169 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 35 </td>
   <td style="text-align:left;"> Yes </td>
   <td style="text-align:right;"> 90 </td>
   <td style="text-align:right;"> 47.6 </td>
   <td style="text-align:right;"> 161 </td>
   <td style="text-align:right;"> 18.36349 </td>
   <td style="text-align:left;"> No </td>
   <td style="text-align:left;"> No </td>
   <td style="text-align:left;"> Yes </td>
   <td style="text-align:left;"> Yes </td>
   <td style="text-align:left;"> Less </td>
   <td style="text-align:right;"> 11 </td>
   <td style="text-align:left;"> No </td>
   <td style="text-align:left;"> No </td>
   <td style="text-align:left;"> No </td>
   <td style="text-align:left;"> No </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 179 </td>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 116 </td>
   <td style="text-align:left;"> Yes </td>
   <td style="text-align:right;"> 85 </td>
   <td style="text-align:right;"> 54.9 </td>
   <td style="text-align:right;"> 159 </td>
   <td style="text-align:right;"> 21.71591 </td>
   <td style="text-align:left;"> No </td>
   <td style="text-align:left;"> Yes </td>
   <td style="text-align:left;"> Yes </td>
   <td style="text-align:left;"> No </td>
   <td style="text-align:left;"> Greater </td>
   <td style="text-align:right;"> 11 </td>
   <td style="text-align:left;"> No </td>
   <td style="text-align:left;"> No </td>
   <td style="text-align:left;"> No </td>
   <td style="text-align:left;"> No </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 328 </td>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 59 </td>
   <td style="text-align:left;"> Yes </td>
   <td style="text-align:right;"> 90 </td>
   <td style="text-align:right;"> 61.7 </td>
   <td style="text-align:right;"> 161 </td>
   <td style="text-align:right;"> 23.80309 </td>
   <td style="text-align:left;"> No </td>
   <td style="text-align:left;"> Yes </td>
   <td style="text-align:left;"> Yes </td>
   <td style="text-align:left;"> No </td>
   <td style="text-align:left;"> Same </td>
   <td style="text-align:right;"> 10 </td>
   <td style="text-align:left;"> No </td>
   <td style="text-align:left;"> Yes </td>
   <td style="text-align:left;"> Yes </td>
   <td style="text-align:left;"> Yes </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
</tbody>
</table>
No description has been provided for this image

More information is required to determine whether this is an independent variable or a composite of the other variables. We will include it with caution. Increasing fracture risk score appears to be associated with an increased risk of fracture. The Loess fit indicates an increasing trend with the exception of levels 10 and 11. These levels have a very small sample size, but do not have any fractures. One can imagine that people in this category may be extremely cautions or bedridden. We expect that this predictor will be important in the model, and will consider excluding levels 10 and 11.

Fracture Risk Score (Composite Risk Score) with levels 10 and 11 excluded:

In [20]:
# Fracture by fracscore
bone %>%
    filter(fracscore != 10 & fracscore != 11) %>%
    ggplot(aes(x = fracscore, fill = fracture)) +
    geom_bar() +
    theme_gdocs() +
    labs(title = "Fracture by Fracture Risk Score", x = "Fracture Risk Score", y = "Count", fill = "Fracture")

# Scatterplot using fracture.num and loess curve
bone %>%
    filter(fracscore != 10 & fracscore != 11) %>%
    ggplot(aes(x = fracscore, y = fracture.num)) +
    geom_point(position = position_jitter(width = 0.3, height = 0.05)) +
    geom_smooth(method = "loess") +
    theme_gdocs() +
    labs(title = "Fracture Risk Score vs. Fracture", x = "Fracture Risk Score", y = "Fracture")
`geom_smooth()` using formula = 'y ~ x'
No description has been provided for this image
No description has been provided for this image

Fracture Risk Score with levels 10 and 11 excluded appears to be a better predictor of fracture.

Bone Medications at Enrollment:

In [21]:
summary(factor(bone$bonemed))

# Fracture by bonemed
bone %>%
    ggplot(aes(x = bonemed, fill = fracture)) +
    geom_bar() +
    theme_gdocs() +
    labs(
        title = "Fracture by Bone Medications at Enrollment",
        x = "Bone Medications at Enrollment",
        y = "Count",
        fill = "Fracture"
    )

# Table of fracture prevalence by bonemed
bone %>%
    group_by(bonemed, fracture) %>%
    summarise(Count = n()) %>%
    pivot_wider(names_from = fracture, values_from = Count) %>%
    mutate(Total = `No` + `Yes`) %>%
    mutate(`No` = round(`No` / Total, 3)) %>%
    mutate(`Yes` = round(`Yes` / Total, 3)) %>%
    select(-Total) %>%
    rename(`Bone Medications at Enrollment` = bonemed) %>%
    kable("html", caption = "Fracture by Bone Medications at Enrollment") %>%
    kable_styling("striped", full_width = F)

# Fisher's exact test for difference
fisher.test(bone$fracture, bone$bonemed)

# Checking this method for fun. Same result as Fisher, harder syntax
# library(epitools)
# epitab(bone$bonemed, bone$fracture, method = "oddsratio", oddsratio = "fisher", pvalue = "fisher.exact")
No
371
Yes
129
`summarise()` has grouped output by 'bonemed'. You can override using the
`.groups` argument.
<table class="table table-striped" style="width: auto !important; margin-left: auto; margin-right: auto;">
<caption>Fracture by Bone Medications at Enrollment</caption>
 <thead>
  <tr>
   <th style="text-align:left;"> Bone Medications at Enrollment </th>
   <th style="text-align:right;"> No </th>
   <th style="text-align:right;"> Yes </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> No </td>
   <td style="text-align:right;"> 0.787 </td>
   <td style="text-align:right;"> 0.213 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Yes </td>
   <td style="text-align:right;"> 0.643 </td>
   <td style="text-align:right;"> 0.357 </td>
  </tr>
</tbody>
</table>
	Fisher's Exact Test for Count Data

data:  bone$fracture and bone$bonemed
p-value = 0.002024
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 1.286032 3.238662
sample estimates:
odds ratio 
  2.045305 
No description has been provided for this image

Bone medications at enrollment appears to be a strongly associated with fractures. The odds of fracture are greater by 2x for those who are taking bone medications, and the difference is statistically significant (p < 0.05 from Fisher's exact test).

Bone Medications at Follow-up:

In [22]:
summary(factor(bone$bonemed_fu))

# Fracture by bonemed_fu
bone %>%
    ggplot(aes(x = bonemed_fu, fill = fracture)) +
    geom_bar() +
    theme_gdocs() +
    labs(
        title = "Fracture by Bone Medications at Follow-up",
        x = "Bone Medications at Follow-up",
        y = "Count",
        fill = "Fracture"
    )

# Table of fracture prevalence by bonemed_fu
bone %>%
    group_by(bonemed_fu, fracture) %>%
    summarise(Count = n()) %>%
    pivot_wider(names_from = fracture, values_from = Count) %>%
    mutate(Total = `No` + `Yes`) %>%
    mutate(`No` = round(`No` / Total, 3)) %>%
    mutate(`Yes` = round(`Yes` / Total, 3)) %>%
    select(-Total) %>%
    rename(`Bone Medications at Follow-up` = bonemed_fu) %>%
    kable("html", caption = "Fracture by Bone Medications at Follow-up") %>%
    kable_styling("striped", full_width = F)

# Fisher's exact test for difference
fisher.test(bone$fracture, bone$bonemed_fu)
No
361
Yes
139
`summarise()` has grouped output by 'bonemed_fu'. You can override using the
`.groups` argument.
<table class="table table-striped" style="width: auto !important; margin-left: auto; margin-right: auto;">
<caption>Fracture by Bone Medications at Follow-up</caption>
 <thead>
  <tr>
   <th style="text-align:left;"> Bone Medications at Follow-up </th>
   <th style="text-align:right;"> No </th>
   <th style="text-align:right;"> Yes </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> No </td>
   <td style="text-align:right;"> 0.801 </td>
   <td style="text-align:right;"> 0.199 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Yes </td>
   <td style="text-align:right;"> 0.619 </td>
   <td style="text-align:right;"> 0.381 </td>
  </tr>
</tbody>
</table>
	Fisher's Exact Test for Count Data

data:  bone$fracture and bone$bonemed_fu
p-value = 4.787e-05
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 1.570473 3.877015
sample estimates:
odds ratio 
   2.46857 
No description has been provided for this image

Bone medications at follow-up appears to be a strongly associated with fractures. The odds of fracture are greater by 2.4x for those who are taking bone medications, and the difference is statistically significant (p < 0.0001 from Fisher's exact test). We expect this to be an even stronger predictor than bone medications at enrollment.

Bone Medications at both Enrollment and Follow-up:

In [23]:
summary(factor(bone$bonetreat))

# Fracture by bonetreat
bone %>%
    ggplot(aes(x = bonetreat, fill = fracture)) +
    geom_bar() +
    theme_gdocs() +
    labs(
        title = "Fracture by Bone Medications at both Enrollment and Follow-up",
        x = "Bone Medications at both Enrollment and Follow-up",
        y = "Count",
        fill = "Fracture"
    )

# Table of fracture prevalence by bonetreat
bone %>%
    group_by(bonetreat, fracture) %>%
    summarise(Count = n()) %>%
    pivot_wider(names_from = fracture, values_from = Count) %>%
    mutate(Total = `No` + `Yes`) %>%
    mutate(`No` = round(`No` / Total, 3)) %>%
    mutate(`Yes` = round(`Yes` / Total, 3)) %>%
    select(-Total) %>%
    rename(`Bone Medications at both Enrollment and Follow-up` = bonetreat) %>%
    kable("html", caption = "Fracture by Bone Medications at both Enrollment and Follow-up") %>%
    kable_styling("striped", full_width = F)

# Fisher's exact test for difference
fisher.test(bone$fracture, bone$bonetreat)
No
382
Yes
118
`summarise()` has grouped output by 'bonetreat'. You can override using the
`.groups` argument.
<table class="table table-striped" style="width: auto !important; margin-left: auto; margin-right: auto;">
<caption>Fracture by Bone Medications at both Enrollment and Follow-up</caption>
 <thead>
  <tr>
   <th style="text-align:left;"> Bone Medications at both Enrollment and Follow-up </th>
   <th style="text-align:right;"> No </th>
   <th style="text-align:right;"> Yes </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> No </td>
   <td style="text-align:right;"> 0.777 </td>
   <td style="text-align:right;"> 0.223 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Yes </td>
   <td style="text-align:right;"> 0.661 </td>
   <td style="text-align:right;"> 0.339 </td>
  </tr>
</tbody>
</table>
	Fisher's Exact Test for Count Data

data:  bone$fracture and bone$bonetreat
p-value = 0.01466
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 1.106815 2.871974
sample estimates:
odds ratio 
  1.789657 
No description has been provided for this image

Bone medication at both enrollment and follow-up is associated with a higher prevalence of fractures. The odds of fracture are greater by 1.79x for those who are taking bone medications, and the difference is statistically significant (p < 0.02 from Fisher's exact test).

Visualize scatterplot matrices by fracture between numeric variables:

In [24]:
library(GGally)
# Scatterplot matrix by fracture
bone %>%
    select(c(5, 6, 7, 8, 13, 14, 15)) %>%
    ggpairs(columns = 1:ncol(.),
        mapping = aes(color = fracture),
        lower = list(continuous = wrap("points", alpha = 0.3))) +
    theme_gdocs() +
    labs(title = "Scatterplot Matrix by Fracture", color = "Fracture")
Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
No description has been provided for this image

There is some multicollinearity between the numeric variables. There previously discussed relationships between the variables and fracture rate hold, but there doesn't appear to be obvious separation between the groups. Fracture score (composite) has the most correlation with age at 0.87. We will still include this variable unless it causes problems.

Correlations between variables:

In [25]:
mydata <- bone

# Convert the categorical variables to factors
mydata[sapply(mydata, is.character)] <- lapply(mydata[sapply(mydata, is.character)], as.factor)

# Convert factors to numeric
mydata[sapply(mydata, is.factor)] <- lapply(mydata[sapply(mydata, is.factor)], as.numeric)

# Calculate the correlation matrix
correlation_matrix <- cor(mydata)

library(corrplot)

palette = colorRampPalette(c("green", "white", "red")) (20)
heatmap(x = correlation_matrix, col = palette, symm = TRUE)
corrplot 0.92 loaded

No description has been provided for this image

There is some multicollinearity between related variables. For example between weight and BMI, and among subject id, site id, and physician id. The bone medication variables also have a high correlation with each other. Composite fracture score is correlated with age, prior fracture and arm assist. These were all predicted to be important variables and the correlation may cause issues with variance inflation.

Objective 1¶

The first objective is to build a logistic regression for interpretation. Although the objective of this section is interpretation, we will split the dataset into training and validation sets to evaluate the model. These initial results will serve to compare the increased performance of the more complex models.

In [26]:
# Split into train / test using caret
set.seed(123)
trainIndex <- createDataPartition(bone$fracture, p = .7, list = FALSE)
train <- bone[trainIndex, ]
test <- bone[-trainIndex, ]

Logistic Regression Model for Interpretable Results¶

For the interpretable model, we will fit a logistic regression model with all of the variables. We will perform feature selection to remove insignificant variables, noting whether the significance levels correspond with the EDA.

In [27]:
# Logistic regression model all predictors
bone_logit <- glm(fracture ~ ., data = train, family = "binomial")
summary(bone_logit)

library(car)
vif(bone_logit)
Warning message:
“glm.fit: algorithm did not converge”
Call:
glm(formula = fracture ~ ., family = "binomial", data = train)

Coefficients:
                  Estimate Std. Error z value Pr(>|z|)
(Intercept)     -2.657e+01  2.346e+06   0.000    1.000
sub_id           3.569e-16  2.045e+02   0.000    1.000
site_id          7.793e-13  4.668e+04   0.000    1.000
phy_id          -3.624e-15  7.157e+02   0.000    1.000
priorfracYes    -3.703e-12  7.552e+04   0.000    1.000
age              8.034e-15  1.014e+04   0.000    1.000
weight           2.550e-13  1.671e+04   0.000    1.000
height          -1.951e-13  1.478e+04   0.000    1.000
bmi             -1.224e-12  4.456e+04   0.000    1.000
premenoYes       3.792e-12  5.086e+04   0.000    1.000
momfracYes      -2.193e-12  7.530e+04   0.000    1.000
armassistYes     1.666e-12  1.088e+05   0.000    1.000
smokeYes        -2.286e-13  9.863e+04   0.000    1.000
rateriskSame    -6.434e-13  4.694e+04   0.000    1.000
rateriskGreater -3.518e-13  5.507e+04   0.000    1.000
fracscore        1.424e-13  4.998e+04   0.000    1.000
bonemedYes       5.070e-11  1.252e+05   0.000    1.000
bonemed_fuYes   -4.743e-13  1.091e+05   0.000    1.000
bonetreatYes    -5.465e-11  1.698e+05   0.000    1.000
fracture.num     5.313e+01  7.065e+04   0.001    0.999

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 3.9531e+02  on 350  degrees of freedom
Residual deviance: 2.0364e-09  on 331  degrees of freedom
AIC: 40

Number of Fisher Scoring iterations: 25
Loading required package: carData


Attaching package: ‘car’


The following object is masked from ‘package:dplyr’:

    recode


The following object is masked from ‘package:purrr’:

    some


A matrix: 18 × 3 of type dbl
GVIFDfGVIF^(1/(2*Df))
sub_id 2.3852931 1.544439
site_id 20.2938701 4.504872
phy_id 20.2644391 4.501604
priorfrac 2.7027151 1.643994
age 22.8700821 4.782267
weight201.026472114.178380
height 24.1158461 4.910789
bmi177.912345113.338379
premeno 1.1055631 1.051458
momfrac 2.0428011 1.429266
armassist 7.6356981 2.763277
smoke 1.4467031 1.202790
raterisk 1.3434322 1.076599
fracscore 43.4396491 6.590876
bonemed 8.0821111 2.842905
bonemed_fu 6.2810241 2.506197
bonetreat 13.7838931 3.712667
fracture.num 2.5951131 1.610935

Including all of the predictors results in a model that does not converge. This is likely due to the structure of the dataset which makes sub_id and, to a lesser extent, site_id and phy_id perfect predictors of fracture. This is further demonstrated by some of the very high variance inflation factors in this model.

Next, we fit a logistic regression with the offending predictors removed:

In [28]:
# Logistic regression model useful predictors
bone_logit <- glm(fracture ~ . -fracture.num -sub_id -site_id -phy_id, data = train, family = "binomial")
summary(bone_logit)

library(car)
#(bone_logit)
Call:
glm(formula = fracture ~ . - fracture.num - sub_id - site_id - 
    phy_id, family = "binomial", data = train)

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)   
(Intercept)     -35.99677   16.16707  -2.227  0.02598 * 
priorfracYes      0.78136    0.50589   1.545  0.12246   
age               0.03050    0.07058   0.432  0.66570   
weight           -0.23595    0.11803  -1.999  0.04560 * 
height            0.19060    0.10240   1.861  0.06269 . 
bmi               0.64398    0.31160   2.067  0.03877 * 
premenoYes        0.16339    0.35704   0.458  0.64723   
momfracYes        0.64632    0.50215   1.287  0.19806   
armassistYes      0.02265    0.76122   0.030  0.97626   
smokeYes         -0.28240    0.72646  -0.389  0.69747   
rateriskSame      0.50019    0.35366   1.414  0.15727   
rateriskGreater   0.73777    0.39191   1.882  0.05977 . 
fracscore         0.06095    0.35140   0.173  0.86230   
bonemedYes        2.00515    0.77236   2.596  0.00943 **
bonemed_fuYes     1.46647    0.65483   2.239  0.02512 * 
bonetreatYes     -3.10117    1.03350  -3.001  0.00269 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 395.31  on 350  degrees of freedom
Residual deviance: 336.20  on 335  degrees of freedom
AIC: 368.2

Number of Fisher Scoring iterations: 4

This model appears to be more stable. As expected, BMI and fracscore appear to be causing some variance inflation issues due to multicollinearity. We will remove these variables and refit the model:

In [29]:
# Logistic regression model with useful predictors
bone_logit <- glm(fracture ~ . - fracture.num - sub_id - site_id - phy_id - bmi - fracscore,
    data = train, family = "binomial"
)
summary(bone_logit)

library(car)
#vif(bone_logit)
Call:
glm(formula = fracture ~ . - fracture.num - sub_id - site_id - 
    phy_id - bmi - fracscore, family = "binomial", data = train)

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)   
(Intercept)     -2.543797   4.119675  -0.617  0.53692   
priorfracYes     0.827962   0.319981   2.588  0.00967 **
age              0.041162   0.018119   2.272  0.02310 * 
weight           0.007568   0.010470   0.723  0.46980   
height          -0.018499   0.024268  -0.762  0.44589   
premenoYes       0.174294   0.355327   0.491  0.62377   
momfracYes       0.719337   0.351618   2.046  0.04078 * 
armassistYes     0.124069   0.319554   0.388  0.69783   
smokeYes        -0.278345   0.643426  -0.433  0.66531   
rateriskSame     0.460120   0.352219   1.306  0.19144   
rateriskGreater  0.706537   0.389731   1.813  0.06985 . 
bonemedYes       1.974352   0.765409   2.579  0.00990 **
bonemed_fuYes    1.421071   0.650332   2.185  0.02888 * 
bonetreatYes    -3.039918   1.023637  -2.970  0.00298 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 395.31  on 350  degrees of freedom
Residual deviance: 340.69  on 337  degrees of freedom
AIC: 368.69

Number of Fisher Scoring iterations: 4

This model appears to be stable. We will use this model for interpretation.

In [30]:
anova(bone_logit, test = "Chisq") %>% kable() %>% kable_styling("striped", full_width = F)
<table class="table table-striped" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:right;"> Df </th>
   <th style="text-align:right;"> Deviance </th>
   <th style="text-align:right;"> Resid. Df </th>
   <th style="text-align:right;"> Resid. Dev </th>
   <th style="text-align:right;"> Pr(&gt;Chi) </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> NULL </td>
   <td style="text-align:right;"> NA </td>
   <td style="text-align:right;"> NA </td>
   <td style="text-align:right;"> 350 </td>
   <td style="text-align:right;"> 395.3076 </td>
   <td style="text-align:right;"> NA </td>
  </tr>
  <tr>
   <td style="text-align:left;"> priorfrac </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 22.6117687 </td>
   <td style="text-align:right;"> 349 </td>
   <td style="text-align:right;"> 372.6959 </td>
   <td style="text-align:right;"> 0.0000020 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> age </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 8.4271304 </td>
   <td style="text-align:right;"> 348 </td>
   <td style="text-align:right;"> 364.2687 </td>
   <td style="text-align:right;"> 0.0036966 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> weight </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0.2145289 </td>
   <td style="text-align:right;"> 347 </td>
   <td style="text-align:right;"> 364.0542 </td>
   <td style="text-align:right;"> 0.6432406 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> height </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0.7752355 </td>
   <td style="text-align:right;"> 346 </td>
   <td style="text-align:right;"> 363.2790 </td>
   <td style="text-align:right;"> 0.3786023 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> premeno </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0.1828968 </td>
   <td style="text-align:right;"> 345 </td>
   <td style="text-align:right;"> 363.0961 </td>
   <td style="text-align:right;"> 0.6688955 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> momfrac </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 4.5683554 </td>
   <td style="text-align:right;"> 344 </td>
   <td style="text-align:right;"> 358.5277 </td>
   <td style="text-align:right;"> 0.0325678 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> armassist </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 1.0993530 </td>
   <td style="text-align:right;"> 343 </td>
   <td style="text-align:right;"> 357.4284 </td>
   <td style="text-align:right;"> 0.2944081 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> smoke </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0.2535692 </td>
   <td style="text-align:right;"> 342 </td>
   <td style="text-align:right;"> 357.1748 </td>
   <td style="text-align:right;"> 0.6145730 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> raterisk </td>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 5.1245459 </td>
   <td style="text-align:right;"> 340 </td>
   <td style="text-align:right;"> 352.0502 </td>
   <td style="text-align:right;"> 0.0771292 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> bonemed </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 2.0363474 </td>
   <td style="text-align:right;"> 339 </td>
   <td style="text-align:right;"> 350.0139 </td>
   <td style="text-align:right;"> 0.1535780 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> bonemed_fu </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0.1099135 </td>
   <td style="text-align:right;"> 338 </td>
   <td style="text-align:right;"> 349.9040 </td>
   <td style="text-align:right;"> 0.7402427 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> bonetreat </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 9.2133604 </td>
   <td style="text-align:right;"> 337 </td>
   <td style="text-align:right;"> 340.6906 </td>
   <td style="text-align:right;"> 0.0024026 </td>
  </tr>
</tbody>
</table>

The following variables are statistically significant at the 0.05 level: age, priorfrac, momfrac, and bonetreat. The variable raterisk is significant at 0.10. This is consistent with the EDA with the exception that the variable armassist is not significant. We will keep all of the variables in the model for interpretation.

Interpretation¶

In [31]:
# Odds ratios
exp(cbind(OR = coef(bone_logit), confint(bone_logit))) %>%
    kable("html", caption = "Odds Ratios", digits = 3) %>%
    kable_styling("striped", full_width = F)
Waiting for profiling to be done...

<table class="table table-striped" style="width: auto !important; margin-left: auto; margin-right: auto;">
<caption>Odds Ratios</caption>
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:right;"> OR </th>
   <th style="text-align:right;"> 2.5 % </th>
   <th style="text-align:right;"> 97.5 % </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:right;"> 0.079 </td>
   <td style="text-align:right;"> 0.000 </td>
   <td style="text-align:right;"> 274.693 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> priorfracYes </td>
   <td style="text-align:right;"> 2.289 </td>
   <td style="text-align:right;"> 1.219 </td>
   <td style="text-align:right;"> 4.287 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> age </td>
   <td style="text-align:right;"> 1.042 </td>
   <td style="text-align:right;"> 1.006 </td>
   <td style="text-align:right;"> 1.080 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> weight </td>
   <td style="text-align:right;"> 1.008 </td>
   <td style="text-align:right;"> 0.987 </td>
   <td style="text-align:right;"> 1.029 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> height </td>
   <td style="text-align:right;"> 0.982 </td>
   <td style="text-align:right;"> 0.935 </td>
   <td style="text-align:right;"> 1.029 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> premenoYes </td>
   <td style="text-align:right;"> 1.190 </td>
   <td style="text-align:right;"> 0.582 </td>
   <td style="text-align:right;"> 2.359 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> momfracYes </td>
   <td style="text-align:right;"> 2.053 </td>
   <td style="text-align:right;"> 1.020 </td>
   <td style="text-align:right;"> 4.072 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> armassistYes </td>
   <td style="text-align:right;"> 1.132 </td>
   <td style="text-align:right;"> 0.601 </td>
   <td style="text-align:right;"> 2.111 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> smokeYes </td>
   <td style="text-align:right;"> 0.757 </td>
   <td style="text-align:right;"> 0.187 </td>
   <td style="text-align:right;"> 2.454 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> rateriskSame </td>
   <td style="text-align:right;"> 1.584 </td>
   <td style="text-align:right;"> 0.801 </td>
   <td style="text-align:right;"> 3.207 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> rateriskGreater </td>
   <td style="text-align:right;"> 2.027 </td>
   <td style="text-align:right;"> 0.949 </td>
   <td style="text-align:right;"> 4.400 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> bonemedYes </td>
   <td style="text-align:right;"> 7.202 </td>
   <td style="text-align:right;"> 1.680 </td>
   <td style="text-align:right;"> 37.333 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> bonemed_fuYes </td>
   <td style="text-align:right;"> 4.142 </td>
   <td style="text-align:right;"> 1.162 </td>
   <td style="text-align:right;"> 15.636 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> bonetreatYes </td>
   <td style="text-align:right;"> 0.048 </td>
   <td style="text-align:right;"> 0.006 </td>
   <td style="text-align:right;"> 0.341 </td>
  </tr>
</tbody>
</table>

Examples of interpretation:

  • Holding all other variables constant, the odds of fracture increase by 4% for each year increase in age. We can be 95% confident that the true increase in odds ratio is between 0.6% and 8%. This increase is both statistically (p < 0.05) and practically significant (5 year increase in age corresponds with a 22% increase in the odds of fracture).
  • Holding all other variables constant, the odds of fracture increase by 4.1x for patients taking bone medications at follow-up. We can be 95% confident that the true increase in odds ratio is between 1.2x and 15.6x. This increase is both statistically (p < 0.05) and practically significant.

The following effects plots show how changes in predictors affect the probability of fracture:

In [32]:
# New Way
library(effects)
plot(effect("age", bone_logit))
plot(effect("raterisk", bone_logit))
plot(effect("priorfrac", bone_logit))
plot(effect("momfrac", bone_logit))
plot(effect("armassist", bone_logit))
plot(effect("bonemed", bone_logit))
plot(effect("bonemed_fu", bone_logit))
plot(effect("bonetreat", bone_logit))


# Prelive Way
library(sjPlot)
library(sjmisc)
plot_model(bone_logit,type="pred",terms=c("age","raterisk"))
plot_model(bone_logit,type="pred",terms=c("bonemed","bonemed_fu","bonetreat"))
Use the command
    lattice::trellis.par.set(effectsTheme())
  to customize lattice options for effects plots.
See ?effectsTheme for details.

No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
#refugeeswelcome


Attaching package: ‘sjmisc’


The following object is masked from ‘package:purrr’:

    is_empty


The following object is masked from ‘package:tidyr’:

    replace_na


The following object is masked from ‘package:tibble’:

    add_case


Data were 'prettified'. Consider using `terms="age [all]"` to get smooth
  plots.

No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Error Metrics¶

The error metrics below pertain to how the logistic regression model performs on the test data.

In [33]:
library(pROC)
# Predicting probabilities on the test data
logit.probs <- predict(bone_logit, test, type = "response")
logit.roc <- roc(response = test$fracture, predictor = logit.probs, levels = c("No", "Yes"))


plot(logit.roc,print.thres="best",col="red")
# plot(glmnet.roc,add=T,col="lightblue")
legend("bottomright",
       legend=c("Simple", "Complex","GLMNET"),
       col=c("black", "red","lightblue"),
       lwd=4, cex =1, xpd = TRUE, horiz = FALSE)
auc(logit.roc)

# Confusion Matrix
logit.probs <- predict(bone_logit, test, type = "response")
logit.pred <- ifelse(logit.probs > 0.320, "Yes", "No")
logit.conf <- confusionMatrix(factor(logit.pred), factor(test$fracture), positive = "Yes")
logit.conf
Type 'citation("pROC")' for a citation.


Attaching package: ‘pROC’


The following objects are masked from ‘package:stats’:

    cov, smooth, var


Setting direction: controls < cases

0.693532818532818
Confusion Matrix and Statistics

          Reference
Prediction No Yes
       No  89  17
       Yes 23  20
                                          
               Accuracy : 0.7315          
                 95% CI : (0.6529, 0.8008)
    No Information Rate : 0.7517          
    P-Value [Acc > NIR] : 0.7493          
                                          
                  Kappa : 0.3179          
                                          
 Mcnemar's Test P-Value : 0.4292          
                                          
            Sensitivity : 0.5405          
            Specificity : 0.7946          
         Pos Pred Value : 0.4651          
         Neg Pred Value : 0.8396          
             Prevalence : 0.2483          
         Detection Rate : 0.1342          
   Detection Prevalence : 0.2886          
      Balanced Accuracy : 0.6676          
                                          
       'Positive' Class : Yes             
                                          
No description has been provided for this image

The logistic regression model has an AUC of 0.69 with the threshold set at the ideal level 0.32. The confusion matrix indicates that the model has a sensitivity of 0.54 and a specificity of 0.79. The model is better at predicting non-fractures than fractures. We expect that more complicated models will perform much better at prediction.

Objective 2¶

For the second objective, we are attempting find a model with higher predictive performance and without concern for interpretability. To accomplish this, we will introduce complexity into the logistic regression model by adding polynomial and interaction terms and then perform feature selection using the glmnet package. We will also fit a LDA / QDA model and a Nonparametric model and compare the results.

The metric we will use to compare the models is the area under the reciever-operator curve (AUROC). Because this is an observational study and we don't have any specific performance guidelines, there is no clear cutoff for sensitivity and specificity. AUROC represents the oberall predictive power of the model as a combination of sensitivity and specificity and thus will be used to select the best predictive model.

Complex Logistic Regression Model¶

To capture any complexity that the interpretable logistic regression may have missed, we will add two types of variables. The numeric variables will be transformed into polynomial terms up to degree three. Then, all variables in the model will be interacted with each other. There is some evidence of interaction in the data, such as this example of age and raterisk:

In [34]:
#Visualize mean fracture rate by age and raterisk
ggplot(train, aes(x = age, y = fracture.num, color = raterisk)) +
    geom_point() +
    geom_smooth(method = "lm", se = FALSE) +
    labs(x = "Age", y = "Fracture Rate", color = "Risk") +
    theme_gdocs()
`geom_smooth()` using formula = 'y ~ x'
No description has been provided for this image

To account for this, we will simply add all possible interaction terms and then perform feature selection using glmnet.

In [35]:
# Create formula

# Identify numeric columns
numeric_cols <- names(bone)[sapply(bone, is.numeric)] %>%
    setdiff(c("fracture", "fracture.num", "sub_id", "site_id", "phy_id"))

# Generate polynomial terms
poly_terms <- sapply(numeric_cols, function(x) paste0("poly(", x, ", 3)"))

# Generate interaction terms
all_terms <- names(bone) %>%
    setdiff(c("fracture", "fracture.num", "sub_id", "site_id", "phy_id"))
interaction_terms <- combn(all_terms, 2, function(x) paste(x, collapse = ":"), simplify = TRUE)

# Combine everything for the formula
formula_terms <- c(poly_terms, interaction_terms)
formula_string <- paste("fracture ~", paste(formula_terms, collapse = " + "))

# Convert to a formula object
final_formula <- as.formula(formula_string)

print(final_formula)

# Complex Logistic regression model
library(caret)
fitControl <- trainControl(
  method = "repeatedcv",
  number = 5,
  repeats = 1,
  classProbs = TRUE,
  summaryFunction = multiClassSummary
)
set.seed(1234)
# GLMNET
complex.fit <- train(final_formula,
    data = train,
    method = "glmnet",
    trControl = fitControl,
    metric = "AUC"
)
print(complex.fit)
coef(complex.fit$finalModel, complex.fit$finalModel$lambdaOpt)

#Plot lambda Opt
plot(complex.fit$finalModel, xvar="lambda", label=TRUE)
abline(v=log(complex.fit$finalModel$lambdaOpt), col="red", lwd=2, lty=2)
fracture ~ poly(age, 3) + poly(weight, 3) + poly(height, 3) + 
    poly(bmi, 3) + poly(fracscore, 3) + priorfrac:age + priorfrac:weight + 
    priorfrac:height + priorfrac:bmi + priorfrac:premeno + priorfrac:momfrac + 
    priorfrac:armassist + priorfrac:smoke + priorfrac:raterisk + 
    priorfrac:fracscore + priorfrac:bonemed + priorfrac:bonemed_fu + 
    priorfrac:bonetreat + age:weight + age:height + age:bmi + 
    age:premeno + age:momfrac + age:armassist + age:smoke + age:raterisk + 
    age:fracscore + age:bonemed + age:bonemed_fu + age:bonetreat + 
    weight:height + weight:bmi + weight:premeno + weight:momfrac + 
    weight:armassist + weight:smoke + weight:raterisk + weight:fracscore + 
    weight:bonemed + weight:bonemed_fu + weight:bonetreat + height:bmi + 
    height:premeno + height:momfrac + height:armassist + height:smoke + 
    height:raterisk + height:fracscore + height:bonemed + height:bonemed_fu + 
    height:bonetreat + bmi:premeno + bmi:momfrac + bmi:armassist + 
    bmi:smoke + bmi:raterisk + bmi:fracscore + bmi:bonemed + 
    bmi:bonemed_fu + bmi:bonetreat + premeno:momfrac + premeno:armassist + 
    premeno:smoke + premeno:raterisk + premeno:fracscore + premeno:bonemed + 
    premeno:bonemed_fu + premeno:bonetreat + momfrac:armassist + 
    momfrac:smoke + momfrac:raterisk + momfrac:fracscore + momfrac:bonemed + 
    momfrac:bonemed_fu + momfrac:bonetreat + armassist:smoke + 
    armassist:raterisk + armassist:fracscore + armassist:bonemed + 
    armassist:bonemed_fu + armassist:bonetreat + smoke:raterisk + 
    smoke:fracscore + smoke:bonemed + smoke:bonemed_fu + smoke:bonetreat + 
    raterisk:fracscore + raterisk:bonemed + raterisk:bonemed_fu + 
    raterisk:bonetreat + fracscore:bonemed + fracscore:bonemed_fu + 
    fracscore:bonetreat + bonemed:bonemed_fu + bonemed:bonetreat + 
    bonemed_fu:bonetreat
Warning message:
“from glmnet C++ code (error code -91); Convergence for 91th lambda value not reached after maxit=100000 iterations; solutions for larger lambdas returned”
Warning message:
“from glmnet C++ code (error code -86); Convergence for 86th lambda value not reached after maxit=100000 iterations; solutions for larger lambdas returned”
glmnet 

351 samples
 14 predictor
  2 classes: 'No', 'Yes' 

No pre-processing
Resampling: Cross-Validated (5 fold, repeated 1 times) 
Summary of sample sizes: 280, 281, 281, 281, 281 
Resampling results across tuning parameters:

  alpha  lambda        logLoss    AUC        prAUC      Accuracy   Kappa     
  0.10   0.0002517636  0.9578117  0.6487398  0.6083914  0.7181087  0.15806425
  0.10   0.0025176356  0.6721001  0.6665398  0.6306358  0.7350503  0.20541926
  0.10   0.0251763560  0.5540170  0.6896566  0.6346748  0.7377062  0.19229555
  0.55   0.0002517636  0.9908692  0.6448682  0.6083189  0.7209658  0.16290411
  0.55   0.0025176356  0.6472304  0.6635144  0.6287970  0.7350101  0.18114035
  0.55   0.0251763560  0.5526769  0.6644619  0.6072282  0.7406439  0.14203053
  1.00   0.0002517636  1.1544814  0.6277218  0.5940189  0.7123944  0.16016255
  1.00   0.0025176356  0.6306857  0.6601725  0.6262662  0.7463984  0.20872680
  1.00   0.0251763560  0.5405747  0.6719953  0.6108930  0.7407646  0.09397295
  F1         Sensitivity  Specificity  Pos_Pred_Value  Neg_Pred_Value
  0.8207916  0.8672714    0.2758170    0.7818182       0.4440171     
  0.8321376  0.8824383    0.2973856    0.7894778       0.5291667     
  0.8356269  0.8975327    0.2627451    0.7836289       0.5120588     
  0.8229669  0.8710450    0.2758170    0.7825348       0.4498779     
  0.8341874  0.8938316    0.2620915    0.7839173       0.5111538     
  0.8423148  0.9240929    0.1947712    0.7747653       0.4633838     
  0.8156374  0.8521045    0.2980392    0.7847838       0.4256022     
  0.8420317  0.9052250    0.2732026    0.7886394       0.5466667     
  0.8455621  0.9468795    0.1261438    0.7641436       0.4395238     
  Precision  Recall     Detection_Rate  Balanced_Accuracy
  0.7818182  0.8672714  0.6497787       0.5715442        
  0.7894778  0.8824383  0.6610463       0.5899120        
  0.7836289  0.8975327  0.6722736       0.5801389        
  0.7825348  0.8710450  0.6526358       0.5734310        
  0.7839173  0.8938316  0.6696177       0.5779616        
  0.7747653  0.9240929  0.6922736       0.5594321        
  0.7847838  0.8521045  0.6383501       0.5750719        
  0.7886394  0.9052250  0.6781489       0.5892138        
  0.7641436  0.9468795  0.7094165       0.5365117        

AUC was used to select the optimal model using the largest value.
The final values used for the model were alpha = 0.1 and lambda = 0.02517636.
134 x 1 sparse Matrix of class "dgCMatrix"
                                         s1
(Intercept)                   -1.694411e+00
poly(age, 3)1                  .           
poly(age, 3)2                  1.135144e+00
poly(age, 3)3                  2.949949e+00
poly(weight, 3)1               .           
poly(weight, 3)2              -6.158068e+00
poly(weight, 3)3               3.206821e+00
poly(height, 3)1              -3.361492e+00
poly(height, 3)2               .           
poly(height, 3)3              -2.035043e+00
poly(bmi, 3)1                  3.751888e-02
poly(bmi, 3)2                  8.112392e-01
poly(bmi, 3)3                 -2.854427e+00
poly(fracscore, 3)1            1.439617e+00
poly(fracscore, 3)2           -2.244877e+00
poly(fracscore, 3)3           -3.243295e+00
priorfracNo:age                .           
priorfracYes:age               .           
priorfracNo:weight            -4.309180e-03
priorfracYes:weight            3.656712e-03
priorfracNo:height            -3.331819e-04
priorfracYes:height            .           
priorfracNo:bmi               -1.112839e-02
priorfracYes:bmi               1.690953e-02
priorfracNo:premenoYes         4.733499e-01
priorfracYes:premenoYes       -1.615695e+00
priorfracNo:momfracYes         3.467449e-01
priorfracYes:momfracYes       -2.262450e-01
priorfracNo:armassistYes       .           
priorfracYes:armassistYes      4.107876e-01
priorfracNo:smokeYes          -4.381342e-02
priorfracYes:smokeYes         -4.828784e-01
priorfracNo:rateriskSame       .           
priorfracYes:rateriskSame     -5.654224e-02
priorfracNo:rateriskGreater    .           
priorfracYes:rateriskGreater   3.450225e-01
priorfracNo:fracscore          8.089374e-02
priorfracYes:fracscore         .           
priorfracNo:bonemedYes         .           
priorfracYes:bonemedYes        4.178156e-02
priorfracNo:bonemed_fuYes      9.585859e-02
priorfracYes:bonemed_fuYes    -1.491198e-04
priorfracNo:bonetreatYes       .           
priorfracYes:bonetreatYes     -6.554364e-01
age:weight                     .           
age:height                     .           
age:bmi                        3.700088e-06
age:premenoYes                 .           
age:momfracYes                 4.755676e-03
age:armassistYes               .           
age:smokeYes                  -6.612055e-03
age:rateriskSame              -1.374109e-03
age:rateriskGreater            .           
age:fracscore                  1.924300e-04
age:bonemedYes                 3.081204e-03
age:bonemed_fuYes              .           
age:bonetreatYes              -2.863916e-03
weight:height                 -1.930414e-05
weight:bmi                     .           
weight:premenoYes             -4.437456e-04
weight:momfracYes              1.035816e-03
weight:armassistYes            .           
weight:smokeYes                .           
weight:rateriskSame            .           
weight:rateriskGreater         2.393714e-03
weight:fracscore               .           
weight:bonemedYes              2.806503e-03
weight:bonemed_fuYes           1.115636e-04
weight:bonetreatYes            .           
height:bmi                     .           
height:premenoYes              .           
height:momfracYes              6.750728e-05
height:armassistYes            .           
height:smokeYes               -1.760790e-03
height:rateriskSame            .           
height:rateriskGreater         4.097752e-04
height:fracscore               2.041456e-04
height:bonemedYes              .           
height:bonemed_fuYes           .           
height:bonetreatYes           -1.813039e-03
bmi:premenoYes                 .           
bmi:momfracYes                 5.822825e-03
bmi:armassistYes               .           
bmi:smokeYes                   .           
bmi:rateriskSame               .           
bmi:rateriskGreater            9.119236e-03
bmi:fracscore                  7.343444e-04
bmi:bonemedYes                 1.558835e-02
bmi:bonemed_fuYes              1.535395e-02
bmi:bonetreatYes               6.758366e-05
premenoYes:momfracYes         -7.064196e-01
premenoYes:armassistYes        1.845991e-01
premenoYes:smokeYes            .           
premenoYes:rateriskSame        3.315938e-01
premenoYes:rateriskGreater     7.093319e-01
premenoYes:fracscore          -1.187545e-02
premenoYes:bonemedYes         -3.055726e-01
premenoYes:bonemed_fuYes      -8.276016e-02
premenoYes:bonetreatYes       -3.076073e-01
momfracYes:armassistYes       -9.743656e-01
momfracYes:smokeYes           -9.916221e-01
momfracYes:rateriskSame        7.675159e-01
momfracYes:rateriskGreater     .           
momfracYes:fracscore           9.861347e-02
momfracYes:bonemedYes         -7.972900e-01
momfracYes:bonemed_fuYes       .           
momfracYes:bonetreatYes       -7.985847e-01
armassistYes:smokeYes          6.749255e-02
armassistYes:rateriskSame      2.946910e-01
armassistYes:rateriskGreater   5.782252e-02
armassistYes:fracscore        -5.650234e-03
armassistYes:bonemedYes        .           
armassistYes:bonemed_fuYes     .           
armassistYes:bonetreatYes     -4.579217e-01
smokeYes:rateriskSame          5.156104e-01
smokeYes:rateriskGreater      -3.715790e-02
smokeYes:fracscore            -8.246239e-02
smokeYes:bonemedYes            2.001411e+00
smokeYes:bonemed_fuYes         1.364922e+00
smokeYes:bonetreatYes          1.364428e+00
rateriskSame:fracscore        -3.632381e-02
rateriskGreater:fracscore     -1.890310e-03
rateriskSame:bonemedYes        5.108128e-01
rateriskGreater:bonemedYes     .           
rateriskSame:bonemed_fuYes     4.206179e-01
rateriskGreater:bonemed_fuYes  4.426142e-01
rateriskSame:bonetreatYes      4.621957e-02
rateriskGreater:bonetreatYes   .           
fracscore:bonemedYes           9.165401e-02
fracscore:bonemed_fuYes        .           
fracscore:bonetreatYes        -2.636760e-02
bonemedYes:bonemed_fuYes      -1.915447e-01
bonemedYes:bonetreatYes       -1.901355e-01
bonemed_fuYes:bonetreatYes    -1.868474e-01
No description has been provided for this image

Many of the polynomial and interaction terms were removed by glmnet. The final model still contains too many variables to be interpretable, and is much more complex than the previous logistic regression model.

Error Metrics¶

The error metrics below pertain to how the complex logistic regression model performs on the test data.

In [36]:
library(pROC)
# Predicting probabilities on the test data
complex.probs <- predict(complex.fit, test, type = "prob")$Yes
complex.roc <- roc(response = test$fracture, predictor = complex.probs, levels = c("No", "Yes"))


plot(complex.roc,print.thres="best",col="red")
# plot(glmnet.roc,add=T,col="lightblue")
legend("bottomright",
       legend=c("Simple", "Complex","GLMNET"),
       col=c("black", "red","lightblue"),
       lwd=4, cex =1, xpd = TRUE, horiz = FALSE)
auc(complex.roc)

# Confusion Matrix
# complex.probs <- predict(complex.fit, test, type = "raw") #Not needed for caret
complex.pred <- ifelse(complex.probs > 0.320, "Yes", "No")
complex.conf <- confusionMatrix(factor(complex.pred), factor(test$fracture), positive = "Yes")
complex.conf
Setting direction: controls < cases

0.695704633204633
Confusion Matrix and Statistics

          Reference
Prediction No Yes
       No  82  19
       Yes 30  18
                                          
               Accuracy : 0.6711          
                 95% CI : (0.5895, 0.7458)
    No Information Rate : 0.7517          
    P-Value [Acc > NIR] : 0.9895          
                                          
                  Kappa : 0.1988          
                                          
 Mcnemar's Test P-Value : 0.1531          
                                          
            Sensitivity : 0.4865          
            Specificity : 0.7321          
         Pos Pred Value : 0.3750          
         Neg Pred Value : 0.8119          
             Prevalence : 0.2483          
         Detection Rate : 0.1208          
   Detection Prevalence : 0.3221          
      Balanced Accuracy : 0.6093          
                                          
       'Positive' Class : Yes             
                                          
No description has been provided for this image

The complex logistic regression model has an AUC of 0.69 with the threshold set at the ideal level 0.18. The confusion matrix indicates that the model has a sensitivity of 0.48 and a specificity of 0.73. In respect to the metrics we've chosen, the complex model performs the same or worse than the interpretable model. This indicates that either the data is not complex enough to warrant a more complicated model, or that the model is overfitting the data.

LDA / QDA¶

In [37]:
library(aplore3)
library(ggplot2)

mydata<-glow_bonemed
head(mydata)
names(mydata)

#?glow_bonemed

library(caret)
fitControl<-trainControl(method="repeatedcv",number=5,repeats=1,classProbs=TRUE, summaryFunction=mnLogLoss)
set.seed(1234)

mydata_clean = subset(mydata, select = -c(sub_id,site_id,phy_id,bmi))

names(mydata)
names(mydata_clean)

# Split into train / test using caret
set.seed(123)
trainIndex <- createDataPartition(mydata_clean$fracture, p = .7, list = FALSE)
train <- mydata_clean[trainIndex, ]
test <- mydata_clean[-trainIndex, ]

#LDA
lda.fit<-train(fracture~.,
               data=train,
               method="lda",
               trControl=fitControl,
               metric="logLoss")


#Computing predicted probabilities on the training data
predictions <- predict(lda.fit, test, type = "prob")[,"Yes"]

#Getting confusion matrix
threshold=0.177
lda.preds<-factor(ifelse(predictions>threshold,"Yes","No"),levels=c("Yes","No"))
lda.conf <- confusionMatrix(data = lda.preds, reference = test$fracture, mode = "everything")

#QDA
qda.fit<-train(fracture~.,
               data=train,
               method="qda",
               trControl=fitControl,
               metric="logLoss")


#Computing predicted probabilities on the training data
predictions <- predict(qda.fit, test, type = "prob")[,"Yes"]

#Getting confusion matrix
threshold=0.096
qda.preds<-factor(ifelse(predictions>threshold,"Yes","No"),levels=c("Yes","No"))
confusionMatrix(data = qda.preds, reference = test$fracture, mode = "everything")


library(pROC)
#Predicting probabilities on the test data
lda.predprobs <- predict(lda.fit, test, type="prob")
lda.roc<-roc(response=test$fracture,predictor=lda.predprobs[,2],levels=c("Yes","No"))

qda.predprobs <- predict(qda.fit, test, type="prob")
qda.roc<-roc(response=test$fracture,predictor=qda.predprobs[,2],levels=c("Yes","No"))


plot(lda.roc, print.thres="best", col="lightblue")
#plot(qda.roc, print.thres="best", col="red", add=T)
legend("bottomright",
       legend=c("LDA"),
       col=c("lightblue"),
       lwd=4, cex =1, xpd = TRUE, horiz = FALSE)


auc(lda.roc)
auc(qda.roc)
A data.frame: 6 × 18
sub_idsite_idphy_idpriorfracageweightheightbmipremenomomfracarmassistsmokerateriskfracscorefracturebonemedbonemed_fubonetreat
<int><int><int><fct><int><dbl><int><dbl><fct><fct><fct><fct><fct><int><fct><fct><fct><fct>
111 14No 6270.315828.16055NoNo No No Same 1NoNoNoNo
224284No 6587.116034.02344NoNo No No Same 2NoNoNoNo
336305Yes8850.815720.60936NoYesYesNo Less11NoNoNoNo
446309No 8262.116024.25781NoNo No No Less 5NoNoNoNo
551 37No 6168.015229.43213NoNo No No Same 1NoNoNoNo
665299Yes6768.016126.23356NoNo No YesSame 4NoNoNoNo
  1. 'sub_id'
  2. 'site_id'
  3. 'phy_id'
  4. 'priorfrac'
  5. 'age'
  6. 'weight'
  7. 'height'
  8. 'bmi'
  9. 'premeno'
  10. 'momfrac'
  11. 'armassist'
  12. 'smoke'
  13. 'raterisk'
  14. 'fracscore'
  15. 'fracture'
  16. 'bonemed'
  17. 'bonemed_fu'
  18. 'bonetreat'
  1. 'sub_id'
  2. 'site_id'
  3. 'phy_id'
  4. 'priorfrac'
  5. 'age'
  6. 'weight'
  7. 'height'
  8. 'bmi'
  9. 'premeno'
  10. 'momfrac'
  11. 'armassist'
  12. 'smoke'
  13. 'raterisk'
  14. 'fracscore'
  15. 'fracture'
  16. 'bonemed'
  17. 'bonemed_fu'
  18. 'bonetreat'
  1. 'priorfrac'
  2. 'age'
  3. 'weight'
  4. 'height'
  5. 'premeno'
  6. 'momfrac'
  7. 'armassist'
  8. 'smoke'
  9. 'raterisk'
  10. 'fracscore'
  11. 'fracture'
  12. 'bonemed'
  13. 'bonemed_fu'
  14. 'bonetreat'
Warning message in confusionMatrix.default(data = lda.preds, reference = test$fracture, :
“Levels are not in the same order for reference and data. Refactoring data to match.”
Warning message in confusionMatrix.default(data = qda.preds, reference = test$fracture, :
“Levels are not in the same order for reference and data. Refactoring data to match.”
Confusion Matrix and Statistics

          Reference
Prediction No Yes
       No  77  13
       Yes 35  24
                                         
               Accuracy : 0.6779         
                 95% CI : (0.5965, 0.752)
    No Information Rate : 0.7517         
    P-Value [Acc > NIR] : 0.983414       
                                         
                  Kappa : 0.2803         
                                         
 Mcnemar's Test P-Value : 0.002437       
                                         
            Sensitivity : 0.6875         
            Specificity : 0.6486         
         Pos Pred Value : 0.8556         
         Neg Pred Value : 0.4068         
              Precision : 0.8556         
                 Recall : 0.6875         
                     F1 : 0.7624         
             Prevalence : 0.7517         
         Detection Rate : 0.5168         
   Detection Prevalence : 0.6040         
      Balanced Accuracy : 0.6681         
                                         
       'Positive' Class : No             
                                         
Setting direction: controls > cases

Setting direction: controls > cases

0.688947876447876
0.652027027027027
No description has been provided for this image

Nonparametric¶

In [38]:
bone2 = subset(bone, select = -c(fracture.num))

set.seed(123)
trainIndex <- createDataPartition(bone2$fracture, p = .7, list = FALSE)
train <- bone2[trainIndex, ]
test <- bone2[-trainIndex, ]
In [39]:
# Define the grid of hyperparameters to search
tune_grid <- expand.grid(kmax = 1:20, distance = 1:3, kernel = 'optimal')

# Train the KNN model with the hyperparameter grid
control <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
knn_model <- train(fracture ~ ., data = train, method = "kknn", trControl = control, tuneGrid = tune_grid)


print(knn_model)
# Print the best model
print(knn_model$bestTune)
k-Nearest Neighbors 

351 samples
 17 predictor
  2 classes: 'No', 'Yes' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 316, 316, 316, 316, 315, 315, ... 
Resampling results across tuning parameters:

  kmax  distance  Accuracy   Kappa    
   1    1         0.8270682  0.5149114
   1    2         0.8307703  0.5188972
   1    3         0.8175179  0.4659590
   2    1         0.8270682  0.5149114
   2    2         0.8307703  0.5188972
   2    3         0.8175179  0.4659590
   3    1         0.8270682  0.5149114
   3    2         0.8307703  0.5188972
   3    3         0.8175179  0.4659590
   4    1         0.8270682  0.5149114
   4    2         0.8307703  0.5188972
   4    3         0.8175179  0.4659590
   5    1         0.8394009  0.5266385
   5    2         0.8317756  0.5173301
   5    3         0.8204015  0.4677340
   6    1         0.8375241  0.5164199
   6    2         0.8288904  0.5061824
   6    3         0.8222782  0.4681892
   7    1         0.8412543  0.5233755
   7    2         0.8231762  0.4833204
   7    3         0.8213523  0.4632185
   8    1         0.8412543  0.5227428
   8    2         0.8241285  0.4853435
   8    3         0.8194476  0.4571262
   9    1         0.8412543  0.5227428
   9    2         0.8250809  0.4855539
   9    3         0.8223327  0.4611573
  10    1         0.8412543  0.5227428
  10    2         0.8250809  0.4855539
  10    3         0.8223327  0.4611573
  11    1         0.8412543  0.5227428
  11    2         0.8250809  0.4855539
  11    3         0.8223327  0.4611573
  12    1         0.8412543  0.5227428
  12    2         0.8250809  0.4817031
  12    3         0.8223327  0.4611573
  13    1         0.8412543  0.5227428
  13    2         0.8280221  0.4861155
  13    3         0.8195020  0.4489249
  14    1         0.8412543  0.5227428
  14    2         0.8280221  0.4861155
  14    3         0.8185761  0.4441881
  15    1         0.8412543  0.5227428
  15    2         0.8280221  0.4861155
  15    3         0.8185761  0.4441881
  16    1         0.8412543  0.5227428
  16    2         0.8280221  0.4861155
  16    3         0.8185761  0.4441881
  17    1         0.8412543  0.5227428
  17    2         0.8280221  0.4861155
  17    3         0.8185761  0.4441881
  18    1         0.8412543  0.5227428
  18    2         0.8280221  0.4861155
  18    3         0.8185761  0.4441881
  19    1         0.8412543  0.5227428
  19    2         0.8280221  0.4861155
  19    3         0.8185761  0.4441881
  20    1         0.8412543  0.5227428
  20    2         0.8280221  0.4861155
  20    3         0.8185761  0.4441881

Tuning parameter 'kernel' was held constant at a value of optimal
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were kmax = 20, distance = 1 and kernel
 = optimal.
   kmax distance  kernel
58   20        1 optimal
In [40]:
# KNN model training control
control <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
tune_grid <- expand.grid(kmax = 20, distance = 1, kernel = 'optimal')
knn_model <- train(fracture ~ ., data = train, method = "kknn", trControl = control, tuneGrid = tune_grid)

#Predictions on test
knn_probs <- predict(knn_model, test, type = "prob")$Yes
knn_roc <- roc(response = test$fracture, predictor = knn_probs, levels = c("No", "Yes"))

plot(knn_roc,print.thres="best",col="red")
legend("bottomright", legend=c("KNN"), col=c( "black"), lwd=4, cex =1, xpd = TRUE, horiz = FALSE)

auc(knn_roc)

# Confusion Matrix
knn_probs <- predict(knn_model, test, type = "raw") #Not needed for caret
knn_conf <- confusionMatrix(knn_probs, test$fracture, positive = "Yes")
knn_conf
Setting direction: controls < cases

0.911438223938224
Confusion Matrix and Statistics

          Reference
Prediction  No Yes
       No  107  13
       Yes   5  24
                                          
               Accuracy : 0.8792          
                 95% CI : (0.8158, 0.9268)
    No Information Rate : 0.7517          
    P-Value [Acc > NIR] : 8.64e-05        
                                          
                  Kappa : 0.6511          
                                          
 Mcnemar's Test P-Value : 0.09896         
                                          
            Sensitivity : 0.6486          
            Specificity : 0.9554          
         Pos Pred Value : 0.8276          
         Neg Pred Value : 0.8917          
             Prevalence : 0.2483          
         Detection Rate : 0.1611          
   Detection Prevalence : 0.1946          
      Balanced Accuracy : 0.8020          
                                          
       'Positive' Class : Yes             
                                          
No description has been provided for this image

Model Comparisons (ROC Curves)¶

In [41]:
# Create AUC strings for legend
auc_values <- c(
  paste("Simple (AUC =", round(auc(logit.roc), 3), ")"),
  paste("Complex (AUC =", round(auc(complex.roc), 3), ")"),
  paste("LDA (AUC =", round(auc(lda.roc), 3), ")"),  #add QDA/LDA here
  paste("KNN (AUC =", round(auc(knn_roc), 3), ")")
)

plot(logit.roc, print.thres = "best", col = "red")
plot(complex.roc, print.thres = "best", add = TRUE, col = "lightblue")
plot(lda.roc, print.thres = "best", add = TRUE, col = "green")
plot(knn_roc, print.thres = "best", add = TRUE, col = "orange")

# Updated legend
legend("bottomright",
       legend = auc_values,   # Use the updated AUC values for legend
       col = c("red", "lightblue", "green", "orange"),
       lwd = 4, 
       cex = 1,
       xpd = TRUE,
       horiz = FALSE
)

# AUROC
# Create a data frame
auc_df <- data.frame(
  Model = c("Simple Logistic Regression", "Complex Logistic Regression","LDA", "KNN"),
  AUC = c(auc(logit.roc), auc(complex.roc),auc(lda.roc), auc(knn_roc))
)
# Print the data frame
auc_df %>%
    arrange(desc(AUC)) %>%
    kable(digits = 3) %>%
    kable_styling(bootstrap_options = "striped", full_width = F)
<table class="table table-striped" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> Model </th>
   <th style="text-align:right;"> AUC </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> KNN </td>
   <td style="text-align:right;"> 0.911 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Complex Logistic Regression </td>
   <td style="text-align:right;"> 0.696 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Simple Logistic Regression </td>
   <td style="text-align:right;"> 0.694 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> LDA </td>
   <td style="text-align:right;"> 0.689 </td>
  </tr>
</tbody>
</table>
No description has been provided for this image

Conclusion¶

In [42]:
# Error Metrics
  # Sensitivity, Specificity, Prevalence, PPV, NPV, and AUROC

print('Simple Logistic Regression: ')
logit.conf$byClass
auc(logit.roc)

print('Complex Logistic Regression: ')
complex.conf$byClass
auc(complex.roc)

print('LDA: ')
lda.conf$byClass
auc(lda.roc)

print('KNN: ')
knn_conf$byClass
auc(knn_roc)
[1] "Simple Logistic Regression: "
Sensitivity
0.540540540540541
Specificity
0.794642857142857
Pos Pred Value
0.465116279069767
Neg Pred Value
0.839622641509434
Precision
0.465116279069767
Recall
0.540540540540541
F1
0.5
Prevalence
0.248322147651007
Detection Rate
0.134228187919463
Detection Prevalence
0.288590604026846
Balanced Accuracy
0.667591698841699
0.693532818532818
[1] "Complex Logistic Regression: "
Sensitivity
0.486486486486487
Specificity
0.732142857142857
Pos Pred Value
0.375
Neg Pred Value
0.811881188118812
Precision
0.375
Recall
0.486486486486487
F1
0.423529411764706
Prevalence
0.248322147651007
Detection Rate
0.120805369127517
Detection Prevalence
0.322147651006711
Balanced Accuracy
0.609314671814672
0.695704633204633
[1] "LDA: "
Sensitivity
0.517857142857143
Specificity
0.810810810810811
Pos Pred Value
0.892307692307692
Neg Pred Value
0.357142857142857
Precision
0.892307692307692
Recall
0.517857142857143
F1
0.655367231638418
Prevalence
0.751677852348993
Detection Rate
0.389261744966443
Detection Prevalence
0.436241610738255
Balanced Accuracy
0.664333976833977
0.688947876447876
[1] "KNN: "
Sensitivity
0.648648648648649
Specificity
0.955357142857143
Pos Pred Value
0.827586206896552
Neg Pred Value
0.891666666666667
Precision
0.827586206896552
Recall
0.648648648648649
F1
0.727272727272727
Prevalence
0.248322147651007
Detection Rate
0.161073825503356
Detection Prevalence
0.194630872483221
Balanced Accuracy
0.802002895752896
0.911438223938224